Syllables and other String Kernel Extensions
نویسندگان
چکیده
During the last years, the use of string kernels that compare documents has been shown to achieve good results on text classification problems. In this paper we introduce the application of the string kernel in conjunction with syllables. Using syllables shortens the representation of documents compared to a character based representation and as a result reduces computation time. Moreover syllables provide a more natural representation of text; rather than the traditional coarse representation given by the bag-of-words, or the too fine one resulting from considering individual letters only. We give some experimental results which show that syllables can be effectively used in text-categorisation problems. In this paper we also propose two extensions to the string kernel. The first introduces a lambda-weighting scheme, where different symbols can be given differing decay weightings. This may be useful in text and other applications where the insertion of certain symbols may be known to be less significant. We also introduce the concept of ‘soft matching’, where symbols can match (possibly weighted by relevance) even if they are not identical. Again, this provides a method of incorporating prior knowledge where certain symbols can be regarded as a partial or exact match and contribute to the overal similarity measure for two data items. We also give an overview of other kernels for comparing strings that have evolved over the past few years. In the appendix we give a detailed description of a method for computing the string kernel efficiently. [email protected] This text is mainly an extension of [STST02] written by Craig Saunders and John Shawe-Taylor together with the author. Sections describing other string kernels and the efficient computation of the string subseqence kernel have been added, few passages altered to fit into the new context, few errors of the underlying paper corrected, and some hints to recently published literature added.
منابع مشابه
Syllables and other String Kernel
Recently, the use of string kernels that compare documents as a string of letters has been shown to achieve good results on text classiication problems. In this paper we introduce the application of the string kernel in conjunction with syllables. Using syllables shortens the representation of documents and as a result reduces computation time. Moreover syllables provide a more natural represen...
متن کاملSome new extensions of Hardy`s inequality
In this study, by a non-negative homogeneous kernel k we prove some extensions of Hardy's inequalityin two and three dimensions
متن کاملKernel Integrity Protection from Untrusted Extensions Using Dynamic Binary Instrumentation
Device drivers are the major source of concern for maintaining security and reliability of an operating system. Many of these device drivers, developed by third parties, get installed in kernel address space as extensions. These extensions are implicitly trusted and are allowed to interact with each other and kernel through well-defined interfaces and by sharing data in an uncontrolled manner. ...
متن کاملThe Spectrum Kernel: A String Kernel for SVM Protein Classification
We introduce a new sequence-similarity kernel, the spectrum kernel, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. Our kernel is conceptually simple and efficient to compute and, in experiments on the SCOP database, performs well in comparison with state-of-the-art methods for homology detection. Moreover, our method produces an S...
متن کاملPosition-Aware String Kernels with Weighted Shifts and a General Framework to Apply String Kernels to Other Structured Data
In combination with efficient kernel-base learning machines such as Support Vector Machine (SVM), string kernels have proven to be significantly effective in a wide range of research areas (e.g. bioinformatics, text analysis, voice analysis). Many of the string kernels proposed so far take advantage of simpler kernels such as trivial comparison of characters and/or substrings, and are classifie...
متن کامل